FIX: [PERF] Handle utf16 strings using simdutf and std::u16string#526
FIX: [PERF] Handle utf16 strings using simdutf and std::u16string#526ffelixg wants to merge 16 commits into
Conversation
|
I have been evaluating the external library simdutf as a high performance replacement for utf16 -> utf8 conversions, i.e. the functions SQLWCHARToWString, WideToUTF8 and Utf8ToWString. Rather than only using it for the arrow fetch path, I have been trying to make the switch for every location where one of these three functions is used, as the applications follow similar patterns. I didn't use simdutf in every case, another way to eliminate these function calls was to use std::u16string instead of std::wstring when passing strings to/from python. I think this avoids the whole issue where wchars are defined as 32 bit on some OSes but SQLWCHARs are always 16 bit. This brings arrow performance on linux for nvarchars in line with what it should be. If the std::u16string type works as well as I hope it does (will have to see what CI says about mac), there are some more spots where it could be used, for example to replace |
There was a problem hiding this comment.
Pull request overview
This PR introduces a higher-performance and more platform-consistent UTF-16LE → UTF-8 conversion path in the pybind ODBC layer by adopting simdutf and shifting several wide-string interfaces from std::wstring to std::u16string (to avoid wchar_t width differences across OSes).
Changes:
- Add
simdutf(viafind_packageorFetchContent) and use it for UTF-16LE → UTF-8 conversions in diagnostics and data fetch paths. - Replace a number of
std::wstringusages withstd::u16stringfor connection strings, queries, and SQL_C_WCHAR parameter buffers. - Remove legacy SQLWCHAR→wstring conversion utilities that are no longer used.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
mssql_python/pybind/CMakeLists.txt |
Adds simdutf dependency resolution and links it into the ddbc_bindings module. |
mssql_python/pybind/ddbc_bindings.h |
Introduces UTF-16 helpers (dupeSqlWCharAsUtf16Le, utf16LeToUtf8Alloc) and changes ErrorInfo to store UTF-8 std::string. |
mssql_python/pybind/ddbc_bindings.cpp |
Switches parameter binding, diagnostics, query execution, and fetch conversions to UTF-16 + simdutf. |
mssql_python/pybind/connection/connection.h |
Updates connection APIs/state to store connection strings as std::u16string. |
mssql_python/pybind/connection/connection.cpp |
Uses UTF-16 connection string/query handling and returns UTF-8 error messages directly. |
mssql_python/pybind/connection/connection_pool.h |
Updates pooling key types and APIs to use std::u16string. |
mssql_python/pybind/connection/connection_pool.cpp |
Implements pooling with std::u16string connection string keys. |
mssql_python/pybind/unix_utils.h |
Removes the SQLWCHARToWString declaration. |
mssql_python/pybind/unix_utils.cpp |
Removes the SQLWCHARToWString implementation. |
Comments suppressed due to low confidence (1)
mssql_python/pybind/ddbc_bindings.cpp:583
- In the SQL_C_WCHAR non-DAE path, the bound buffer is a std::u16string but the bind call uses
data()+SQL_NTSand setsbufferLengthtosize() * sizeof(SQLWCHAR). ForSQL_NTS, BufferLength should include the null terminator, and the pointer should be a null-terminated buffer (preferc_str()). As written, drivers that validate BufferLength may treat this as truncated or read past the provided length.
Suggested fix: use sqlwcharBuffer->c_str() and set bufferLength to (sqlwcharBuffer->size() + 1) * sizeof(SQLWCHAR), or alternatively set *strLenOrIndPtr to the explicit byte length (excluding terminator) and keep BufferLength consistent.
std::u16string* sqlwcharBuffer = AllocateParamBuffer<std::u16string>(
paramBuffers, param.cast<std::u16string>());
LOG("BindParameters: param[%d] SQL_C_WCHAR - String "
"length=%zu characters, buffer=%zu bytes",
paramIndex, sqlwcharBuffer->size(), sqlwcharBuffer->size() * sizeof(SQLWCHAR));
dataPtr = sqlwcharBuffer->data();
bufferLength = sqlwcharBuffer->size() * sizeof(SQLWCHAR);
strLenOrIndPtr = AllocateParamBuffer<SQLLEN>(paramBuffers);
*strLenOrIndPtr = SQL_NTS;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 1542-1550 1542 LOG("SQLCheckError: Checking ODBC errors - handleType=%d, retcode=%d", handleType, retcode);
1543 ErrorInfo errorInfo;
1544 if (retcode == SQL_INVALID_HANDLE) {
1545 LOG("SQLCheckError: SQL_INVALID_HANDLE detected - handle is invalid");
! 1546 errorInfo.ddbcErrorMsg = "Invalid handle!";
1547 return errorInfo;
1548 }
1549 assert(handle != 0);
1550 SQLHANDLE rawHandle = handle->get();Lines 1617-1625 1617 return records;
1618 }
1619
1620 // Wrap SQLExecDirect
! 1621 SQLRETURN SQLExecDirect_wrap(SqlHandlePtr StatementHandle, const std::u16string& Query) {
1622 LOG("SQLExecDirect: Executing query directly - statement_handle=%p, "
1623 "query_length=%zu chars",
1624 (void*)StatementHandle->get(), Query.length());
1625 if (!SQLExecDirect_ptr) {Lines 1634-1642 1634 SQLSetStmtAttr_ptr(StatementHandle->get(), SQL_ATTR_CONCURRENCY,
1635 (SQLPOINTER)SQL_CONCUR_READ_ONLY, 0);
1636 }
1637
! 1638 SQLWCHAR* queryPtr = reinterpretU16stringAsSqlWChar(Query);
1639 SQLRETURN ret;
1640 {
1641 // Release the GIL during the blocking ODBC call so that other Python
1642 // threads (e.g. asyncio event loop, heartbeat threads) can run whileLines 3749-3757 3749 sizeof(DateTimeOffset) * fetchSize,
3750 buffers.indicators[col - 1].data());
3751 break;
3752 default:
! 3753 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
3754 std::ostringstream errorString;
3755 errorString << "Unsupported data type for column - " << columnName.c_str()
3756 << ", Type - " << dataType << ", column ID - " << col;
3757 LOG("SQLBindColums: %s", errorString.str().c_str());Lines 3758-3766 3758 ThrowStdException(errorString.str());
3759 break;
3760 }
3761 if (!SQL_SUCCEEDED(ret)) {
! 3762 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
3763 std::ostringstream errorString;
3764 errorString << "Failed to bind column - " << columnName.c_str() << ", Type - "
3765 << dataType << ", column ID - " << col;
3766 LOG("SQLBindColums: %s", errorString.str().c_str());Lines 4091-4099 4091 break;
4092 }
4093 default: {
4094 const auto& columnMeta = columnNames[col - 1].cast<py::dict>();
! 4095 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
4096 std::ostringstream errorString;
4097 errorString << "Unsupported data type for column - " << columnName.c_str()
4098 << ", Type - " << dataType << ", column ID - " << col;
4099 LOG("FetchBatchData: %s", errorString.str().c_str());Lines 4193-4201 4193 case SQL_SS_TIMESTAMPOFFSET:
4194 rowSize += sizeof(DateTimeOffset);
4195 break;
4196 default:
! 4197 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
4198 std::ostringstream errorString;
4199 errorString << "Unsupported data type for column - " << columnName.c_str()
4200 << ", Type - " << dataType << ", column ID - " << col;
4201 LOG("calculateRowSize: %s", errorString.str().c_str());mssql_python/pybind/utf_utils.h📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.8%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.error.h: 2.1%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.9%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16.h: 18.2%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 28.7%
mssql_python.pybind.build._deps.simdutf-src.src.simdutf.haswell.simd16-inl.h: 33.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 61.1%
mssql_python.pybind.build._deps.simdutf-src.src.generic.utf16.utf8_length_from_utf16_bytemask.h: 61.5%🔗 Quick Links
|
|
Only issue seems to be that some Linux CI containers don't have git. I'm trying to fetch simdutf via url instead. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@gargsaumya The second CI error was due to the ubuntu image using an older version of cmake. Should be fine now I hope. I have also replaced the remaining std::wstring occurences with std::u16string and eliminated helper functions as well as platform specific code accordingly. Let me know if you are happy with this holistic update to utf16 string handling or if you would prefer to keep it contained to the arrow fetch path. |
|
Thanks for the update @ffelixg. I actually prefer this and it LGTM overall. ButI’ll still take a closer look at the changes and get back to you. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
I think that this latest CI failure was due to a flaky test. The same test / platform passed in a past CI run and I don't think my changes affected that test. Could you rerun that one @gargsaumya? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
gargsaumya
left a comment
There was a problem hiding this comment.
Really appreciate the effort here, this is a very meaningful contribution @ffelixg. I've reviewed the PR and left 2 inline comments, please take a look!
|
Glad you like it @gargsaumya! I totally agree with both your comments and pushed changes accordingly. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Seems like the same flaky test failed again |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
I have a few generic comments before I delve into core review:
|
| result.push_back(static_cast<wchar_t>(UNICODE_REPLACEMENT_CHAR)); | ||
| } | ||
| } | ||
| inline std::string utf16LeToUtf8Alloc(const std::u16string& utf16) { |
There was a problem hiding this comment.
Would recommend to move utf16LeToUtf8Alloc, dupeSqlWCharAsUtf16Le, and reinterpretU16stringAsSqlWChar from ddbc_bindings.h into a small utf_utils.h so they can be unit tested and the changes don't force recompilation of the whole module.
I do not see dedicated unit tests for utf16LeToUtf8Alloc, dupeSqlWCharAsUtf16Le, or reinterpretU16stringAsSqlWChar. Edge cases (empty strings, surrogate pairs, lone surrogates) are only tested indirectly through integration tests. Once moved to utf_utils.h I would suggest to add targeted unit tests.
There was a problem hiding this comment.
By targeted unit tests, am I correct in assuming that you mean pure c++ unit tests? Testing them via python wrappers and pytest feels wrong. It would also involve changes to the build process as well as CI pipelines, since there is currently no infrastructure for c++ tests. The predecessors of these functions were also only tested via integration or even fully inlined as was the case for some reinterpretU16stringAsSqlWChar applications.
There was a problem hiding this comment.
Good point! I agree that setting up a full C++ unit test framework and CI pipeline changes is out of scope for this PR. Let's not block this PR on it, but I'd like to open a tracking issue for it in our internal backlog for future enhancements.
I see you've already added utf_utils.h, thank you! Let's keep moving with this PR. We are expecting one more review on this before we merge it.
There was a problem hiding this comment.
@gargsaumya please track this: "but I'd like to open a tracking issue for it in our internal backlog for future enhancements" in ADO.
Co-authored-by: Copilot <copilot@github.com>
|
Thanks for the review @sumitmsft. Regarding the first point, There's actually one more method that's currently being used to convert unicode aside from simdutf and I hope the added dependency is not too much of an inconvenience. I looked into native iconv as well, but didn't find it appealing in comparison. The simdutf releases look to me like they are mostly about adding new features, which we likely won't care about. Since unicode conversion itself is a solved problem, I was hoping that we would technically be fine using the same version forever, assuming no crazy bugs. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Work Item / Issue Reference
Summary